NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

DOPE: Doubly Optimistic and Pessimistic Exploration for Safe Reinforcement Learning

Bura, Archana; HasanzadeZonuzy, Aria; Kalathil, Dileep; Shakkottai, Srinivas; Chamberland, Jean-Francois (December 2022, Advances in neural information processing systems)

Full Text Available
DOPE: Doubly Optimistic and Pessimistic Exploration for Safe Reinforcement Learning

Bura, Archana; HasanzadeZonuzy, Aria; Kalathil, Dileep; Shakkottai, Srinivas; Chamberland, Jean-Francois (November 2022, Advances in neural information processing systems)

Safe reinforcement learning is extremely challenging--not only must the agent explore an unknown environment, it must do so while ensuring no safety constraint violations. We formulate this safe reinforcement learning (RL) problem using the framework of a finite-horizon Constrained Markov Decision Process (CMDP) with an unknown transition probability function, where we model the safety requirements as constraints on the expected cumulative costs that must be satisfied during all episodes of learning. We propose a model-based safe RL algorithm that we call Doubly Optimistic and Pessimistic Exploration (DOPE), and show that it achieves an objective regret $$\tilde{O}(|\mathcal{S}|\sqrt{|\mathcal{A}| K})$$ without violating the safety constraints during learning, where $$|\mathcal{S}|$$ is the number of states, $$|\mathcal{A}|$$ is the number of actions, and $$K$$ is the number of learning episodes. Our key idea is to combine a reward bonus for exploration (optimism) with a conservative constraint (pessimism), in addition to the standard optimistic model-based exploration. DOPE is not only able to improve the objective regret bound, but also shows a significant empirical performance improvement as compared to earlier optimism-pessimism approaches.
more » « less
Full Text Available
Model-Based Reinforcement Learning for Infinite-Horizon Discounted Constrained Markov Decision Processes

https://doi.org/10.24963/ijcai.2021/347

HasanzadeZonuzy, Aria; Kalathil, Dileep; Shakkottai, Srinivas (August 2021, IJCAI 2021)

In many real-world reinforcement learning (RL) problems, in addition to maximizing the objective, the learning agent has to maintain some necessary safety constraints. We formulate the problem of learning a safe policy as an infinite-horizon discounted Constrained Markov Decision Process (CMDP) with an unknown transition probability matrix, where the safety requirements are modeled as constraints on expected cumulative costs. We propose two model-based constrained reinforcement learning (CRL) algorithms for learning a safe policy, namely, (i) GM-CRL algorithm, where the algorithm has access to a generative model, and (ii) UC-CRL algorithm, where the algorithm learns the model using an upper confidence style online exploration method. We characterize the sample complexity of these algorithms, i.e., the the number of samples needed to ensure a desired level of accuracy with high probability, both with respect to objective maximization and constraint satisfaction.
more » « less
Full Text Available
Model-Based Reinforcement Learning for Infinite-Horizon Discounted Constrained Markov Decision Processes

HasanzadeZonuzy, Aria; Kalathil, Dileep; Shakkottai, Srinivas (August 2021, International Joint Conference on Artificial Intelligence (IJCAI))

In many real-world reinforcement learning (RL) problems, in addition to maximizing the objective, the learning agent has to maintain some necessary safety constraints. We formulate the problem of learning a safe policy as an infinite-horizon discounted Constrained Markov Decision Process (CMDP) with an unknown transition probability matrix, where the safety requirements are modeled as constraints on expected cumulative costs. We propose two model-based constrained reinforcement learning (CRL) algorithms for learning a safe policy, namely, (i) GM-CRL algorithm, where the algorithm has access to a generative model, and (ii) UC-CRL algorithm, where the algorithm learns the model using an upper confidence style online exploration method. We characterize the sample complexity of these algorithms, i.e., the the number of samples needed to ensure a desired level of accuracy with high probability, both with respect to objective maximization and constraint satisfaction.
more » « less
Full Text Available
Learning with Safety Constraints: Sample Complexity of Reinforcement Learning for Constrained MDPs

https://doi.org/10.1609/aaai.v35i9.16937

HasanzadeZonuzy, Aria; Bura, Archana; Kalathil, Dileep; Shakkottai, Srinivas (May 2021, Proceedings of the AAAI Conference on Artificial Intelligence)

Many physical systems have underlying safety considerations that require that the policy employed ensures the satisfaction of a set of constraints. The analytical formulation usually takes the form of a Constrained Markov Decision Process (CMDP). We focus on the case where the CMDP is unknown, and RL algorithms obtain samples to discover the model and compute an optimal constrained policy. Our goal is to characterize the relationship between safety constraints and the number of samples needed to ensure a desired level of accuracy---both objective maximization and constraint satisfaction---in a PAC sense. We explore two classes of RL algorithms, namely, (i) a generative model based approach, wherein samples are taken initially to estimate a model, and (ii) an online approach, wherein the model is updated as samples are obtained. Our main finding is that compared to the best known bounds of the unconstrained regime, the sample complexity of constrained RL algorithms are increased by a factor that is logarithmic in the number of constraints, which suggests that the approach may be easily utilized in real systems.
more » « less
Full Text Available
Reinforcement Learning for Multi-Hop Scheduling and Routing of Real-Time Flows

HasanzadeZonuzy, Aria; Kalathil, Dileep; Shakkottai, Srinivas (January 2020, Proceedings of the International Symposium on Modeling and Optimization in Mobile Ad Hoc and Wireless Networks)
null (Ed.)
We consider the problem of serving real-time flows over a multi-hop wireless network. Each flow is composed of packets that have strict deadlines, and the goal is to maximize the weighted timely throughput of the system. Consistent with recent developments using mm-wave communications, we assume that the links are directional, but are lossy, and have unknown probabilities of successful packet transmission. An average link utilization budget (similar to a power constraint) constrains the system. We pose the problem in the form of a Constrained Markov Decision Process (CMDP) with an unknown transition kernel. We use a duality approach to decompose the problem into an inner unconstrained MDP with link usage costs, and an outer linkcost update step. For the inner MDP, we develop modelbased reinforcement learning algorithms that sample links by sending packets to learn the link statistics. While the first algorithm type samples links at will at the beginning and constructs the model, the second type is an online approach that can only use packets from flows to sample links that they traverse. The approach to the outer problem follows gradient descent. We characterize the sample complexity (number of packets transmitted) to obtain near-optimal policies, to show that a basic online approach has a poorer sample complexity bound, it can be modified to obtain an online algorithm that has excellent empirical performance.
more » « less
Full Text Available
Broadcasting Real-Time Flows in Integrated Backhaul and Access 5G Networks

https://doi.org/10.23919/WiOPT47501.2019.9144141

HasanzadeZonuzy, Aria; Hou, I-Hong Hou; Shakkottai, Srinivas (January 2019, 17th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks, WIOPT 2019)

We study the problem of broadcasting realtime flows in multi-hop wireless networks. We assume that each packet has a stringent deadline, and each node in the network obtains some utility based on the number of packets delivered to it on time for each flow. We propose a distributed protocol called the delegated-set routing (DSR) that incurs virtually no overhead of coordination among nodes. We also develop distributed algorithms that aim to maximize the total timely utility under DSR. The utility of the DSR protocol and distributed algorithms are demonstrated by both theoretical analysis and simulation results. We show that our algorithms achieve higher timely throughput even when compared against centralized throughput optimal policies that do not consider deadline constraints.
more » « less
Full Text Available

Search for: All records